Introduction ¶Phase 1 required us to process our data and approach the standard required for multiple linear regression modeling. We removed unnecessary columns, calculated new columns using data inferred from previous features, found outliers and dropped rows with missing values. We are left with almost 3000 rows, with no missing or unusual data.
Our goal is to explore what factors influence planet discoverability, by creating a Multiple Linear Regression model to predict planet radius. To facilitate this, we made two assumptions at the beginning of the study: That the distribution of exoplanets is independent of their distance from Earth, and that the radius of a planet is correlated with ease of discoverability. Our exploration suggests that the latter is true. We can also assume the former is true (until proven otherwise) as it is the simplest, and most agreed upon description of our universe accepted by astrophysicists.
In phase 1, exploration into the relationships between features revealed a strong link between orbital distance and orbital period, distance from earth and parallax, and planet mass and radius. We also discovered how the number of exoplanets per star system dropped to 1 after 2500 parsecs. This further supports our assumptions. Finally, exploring the positional relationship between exoplanets revealed to us the Kepler mission, and how it dominated our dataset.
These relationships have helped inform the pre-processing we will conduct before creating the MLR model in Phase 2.
The goal of this report is to determine exoplanet radius based off certain astronomical features. We decide to use both MLR and DNN as our modelelling techniquEs to fulfil our goal. We will also be comparing both models to determine which can better, and more accurately, predict our target feature.
We conduct one-hot encoding for our categorical variables, and normalize our numeric features, then build a full MLR model. After performing our diagnostic checks on the MLR model, we conclude that our residual plot is bimodal, which contributes to a weaker MLR. We than use backwards feature selection to build a reduced MLR using 13 variables, which achieves a similar r-squared value to our full model: 0.43.
Concluding that our prediction accuracy may be improved upon, the report goes on to outline the creation of a deep neural network. This begins through dedicated data processing for our dnn followed by fine tuning of its hyperparameters via the use of various fine tuning plots. Resultingly we manage to achieve a significant improvement in accuracy for predicting our target feature. This is further discussed in the DNN Discussion (Literature) subchapter.
The Phase 2 report outlines the additional data processing required for multiple linear regression, full and reduced MLR fitting, diagnostics checks for MLR modelling. Additionally, this report also contains data processing for deep neural networks, neural network fitting, fine tuning plots, neural network discussion, critiques / limitations and a summary of our findings.
We explore our data using both Multiple Linear Regression and Neural Networks to predict the value of planet radius, in order to explore exoplanet discoverability.
Multiple Linear Regression is a model that relies on multiple explanatory variables to predict a target feature. Below you can find the full model using all of our explanatory variables, as well as a reduced model which has utilized backwards feature selection to cull weaker explanatory variables.
The Multiple Linear Regression model predicts planet radius with an r-squared value of 0.4. This is not a strong prediction, so to develop a stronger model we used a deep neural network, explained further below.
Data Processing ¶import numpy as np
# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
sns.set()
!pip install -q seaborn
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import MinMaxScaler, RobustScaler
import patsy
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
df = pd.read_csv('Phase2_Group40.csv')
print('Before Rename:', df.columns.to_list())
# Rename columns to be compatible with patsy
df.rename({'semi-major_axis': 'semi_major_axis', '2_stars': 'two_stars'}, axis=1, inplace=True)
print("After Rename:")
# Remove particular variables for better study
df = df.drop(['planet_mass', 'latitude_gal', 'longitude_gal', 'mass_ratio_sys', 'radius_ratio_sys'], axis=1 )
df.head()
Before Rename: ['num_star', '2_stars', 'orbital_period', 'semi-major_axis', 'planet_radius', 'planet_mass', 'planet_eccen', 'planet_temp', 'star_temp', 'star_radius', 'star_mass', 'star_bright', 'star_age', 'latitude_gal', 'longitude_gal', 'distance', 'parallax', 'mass_ratio_sys', 'radius_ratio_sys', 'num_planet'] After Rename:
| num_star | two_stars | orbital_period | semi_major_axis | planet_radius | planet_eccen | planet_temp | star_temp | star_radius | star_mass | star_bright | star_age | distance | parallax | num_planet | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 0 | 11688.000000 | 12.00000 | 13.400 | 0.45 | 700.0 | 7295.0 | 1.49 | 1.65 | 0.752 | 0.020 | 29.7575 | 33.5770 | 1 |
| 1 | 2 | 0 | 14.651600 | 0.11340 | 13.900 | 0.00 | 700.0 | 5172.0 | 0.94 | 0.91 | -0.197 | 5.500 | 12.5855 | 79.4274 | 5 |
| 2 | 2 | 0 | 0.736547 | 0.01544 | 1.875 | 0.05 | 1958.0 | 5172.0 | 0.94 | 0.91 | -0.197 | 10.200 | 12.5855 | 79.4274 | 5 |
| 3 | 1 | 0 | 8.463000 | 0.06450 | 4.070 | 0.00 | 593.0 | 3700.0 | 0.75 | 0.50 | -1.046 | 0.022 | 9.7221 | 102.8290 | 2 |
| 4 | 1 | 0 | 18.859019 | 0.11010 | 3.240 | 0.00 | 454.0 | 3700.0 | 0.75 | 0.50 | -1.065 | 0.022 | 9.7221 | 102.8290 | 2 |
# Generate a copy for data modification
data_encoded = df.copy()
categorical_vars = [ "num_star", "two_stars", "num_planet"]
for var in categorical_vars:
data_encoded = data_encoded.astype({var: object})
# Categorical encoding for less than 2 values
for col in data_encoded.columns:
q = len(data_encoded[col].unique())
if (q == 2):
data_encoded[col] = pd.get_dummies(data_encoded[col], drop_first=True)
# For categorical features > 2 levels
data_encoded = pd.get_dummies(data_encoded)
print(f"There are {data_encoded.shape[1]} columns with the column names {data_encoded.columns.to_list()} after one hot encoding")
There are 25 columns with the column names ['two_stars', 'orbital_period', 'semi_major_axis', 'planet_radius', 'planet_eccen', 'planet_temp', 'star_temp', 'star_radius', 'star_mass', 'star_bright', 'star_age', 'distance', 'parallax', 'num_star_1', 'num_star_2', 'num_star_3', 'num_star_4', 'num_planet_1', 'num_planet_2', 'num_planet_3', 'num_planet_4', 'num_planet_5', 'num_planet_6', 'num_planet_7', 'num_planet_8'] after one hot encoding
"""
Due to the nature of our dataset, all uint8 types are considered categorical.
"""
# Perform normalisation on only the float types in df_float.
df_float = data_encoded.select_dtypes(include=['float64'])
df_float.drop('planet_radius', inplace=True, axis=1)
print(df_float.columns.to_list())
# TODO: Check if RobustScaler gives out better results
df_norm = MinMaxScaler().fit_transform(df_float)
print(f"The mean of each column in the df_norm dataframe is {np.round(df_norm.mean(axis=0),3)}")
['orbital_period', 'semi_major_axis', 'planet_eccen', 'planet_temp', 'star_temp', 'star_radius', 'star_mass', 'star_bright', 'star_age', 'distance', 'parallax'] The mean of each column in the df_norm dataframe is [0.001 0.005 0.029 0.204 0.388 0.164 0.328 0.69 0.301 0.157 0.016]
data_encoded.loc[:,
['orbital_period', 'semi_major_axis', 'planet_eccen',
'planet_temp', 'star_temp', 'star_radius', 'star_mass', 'star_bright',
'star_age', 'distance', 'parallax'
]
] = pd.DataFrame(df_norm, columns=[
'orbital_period', 'semi_major_axis',
'planet_eccen',
'planet_temp', 'star_temp',
'star_radius', 'star_mass',
'star_bright', 'star_age',
'distance', 'parallax'
])
data_encoded.sample(3)
| two_stars | orbital_period | semi_major_axis | planet_radius | planet_eccen | planet_temp | star_temp | star_radius | star_mass | star_bright | star_age | distance | parallax | num_star_1 | num_star_2 | num_star_3 | num_star_4 | num_planet_1 | num_planet_2 | num_planet_3 | num_planet_4 | num_planet_5 | num_planet_6 | num_planet_7 | num_planet_8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1086 | 0 | 0.002867 | 0.016460 | 2.900 | 0.000000 | 0.041529 | 0.359153 | 0.127186 | 0.278810 | 0.637748 | 0.334286 | 0.210627 | 0.003872 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 170 | 0 | 0.000057 | 0.001267 | 14.022 | 0.098925 | 0.316433 | 0.479748 | 0.178060 | 0.412639 | 0.744182 | 0.028571 | 0.135397 | 0.005789 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1131 | 0 | 0.000106 | 0.001874 | 1.140 | 0.000000 | 0.228025 | 0.458574 | 0.182830 | 0.379182 | 0.715777 | 0.242143 | 0.171407 | 0.004661 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Statistical Modelling ¶Our first attempt at modelling our data will be a Multiple Linear Regression model using all of our explanatory variables.
Here are the variables we will be using for this particular model:
df.columns.tolist()
import warnings
warnings.filterwarnings("ignore")
from tabulate import tabulate
table = [['Name','Data Type','Units','Description'],
['num_star', 'Nominal categorical', 'NA', 'Number of stars in the system'],
['num_planet', 'Nominal categorical', 'NA', 'Number of planets in the system'],
['two_stars', 'Nominal categorical', 'NA', 'Circumbinary flag: whether the planet orbits 2 stars'],
['orbital_period', 'Numeric', 'Earth days', 'Orbital period (Time it takes planet to complete an orbit'],
['semi_major_axis', 'Numeric', 'au', 'Orbit semi-Major Axis. au is the distance from Earth to sun.'],
['planet_radius', 'Numeric', 'Earth radius', 'Planet radius, where 1.0 is Earth\'s radius'],
['planet_eccen', 'Numeric', 'NA', 'Planet\s orbital eccentricity'],
['planet_temp', 'Numeric', 'Kelvin', 'Equilibrium Temperature: (The planetary equilibrium temperature is a the theoretical temperature that a planet would be a black body being heated only by its parent star)'],
['star_temp', 'Numeric', 'Kelvin', 'Stellar Effective Temperature'],
['star_radius', 'Numeric', 'Solar Radius', 'Stellar Radius, where 1.0 is 1 of our Sun\'s radius'],
['star_mass', 'Numeric', 'Solar Mass', 'Stellar Mass, where 1.0 is 1* our Sun\'s mass'],
['star_bright', 'Numeric', 'log(Solar luminosity)', 'Stellar Luminosity'],
['star_age', 'Numeric', 'gyr (Gigayear)', 'Stellar Age'],
['distance', 'Numeric', 'parsec', 'Distance'],
['parallax', 'Numeric', 'mas (miliarcseconds)', 'Parallax: Distance the star moves in relation to other objects in the night sky'],
]
print(tabulate(table, headers='firstrow', tablefmt='simple'))
print("\nOur target variable is planet_radius.")
Name Data Type Units Description --------------- ------------------- --------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- num_star Nominal categorical NA Number of stars in the system num_planet Nominal categorical NA Number of planets in the system two_stars Nominal categorical NA Circumbinary flag: whether the planet orbits 2 stars orbital_period Numeric Earth days Orbital period (Time it takes planet to complete an orbit semi_major_axis Numeric au Orbit semi-Major Axis. au is the distance from Earth to sun. planet_radius Numeric Earth radius Planet radius, where 1.0 is Earth's radius planet_eccen Numeric NA Planet\s orbital eccentricity planet_temp Numeric Kelvin Equilibrium Temperature: (The planetary equilibrium temperature is a the theoretical temperature that a planet would be a black body being heated only by its parent star) star_temp Numeric Kelvin Stellar Effective Temperature star_radius Numeric Solar Radius Stellar Radius, where 1.0 is 1 of our Sun's radius star_mass Numeric Solar Mass Stellar Mass, where 1.0 is 1* our Sun's mass star_bright Numeric log(Solar luminosity) Stellar Luminosity star_age Numeric gyr (Gigayear) Stellar Age distance Numeric parsec Distance parallax Numeric mas (miliarcseconds) Parallax: Distance the star moves in relation to other objects in the night sky Our target variable is planet_radius.
formula_string_indep_vars_encoded = ' + '.join(data_encoded.drop(columns='planet_radius').columns)
formula_string_encoded = 'planet_radius ~ ' + formula_string_indep_vars_encoded
print('formula_string_encoded: ', formula_string_encoded)
formula_string_encoded: planet_radius ~ two_stars + orbital_period + semi_major_axis + planet_eccen + planet_temp + star_temp + star_radius + star_mass + star_bright + star_age + distance + parallax + num_star_1 + num_star_2 + num_star_3 + num_star_4 + num_planet_1 + num_planet_2 + num_planet_3 + num_planet_4 + num_planet_5 + num_planet_6 + num_planet_7 + num_planet_8
OLS model to encoded data
model_full = sm.formula.ols(formula=formula_string_encoded, data=data_encoded)
model_full_fitted = model_full.fit()
print(model_full_fitted.summary())
OLS Regression Results
==============================================================================
Dep. Variable: planet_radius R-squared: 0.429
Model: OLS Adj. R-squared: 0.424
Method: Least Squares F-statistic: 97.92
Date: Sun, 24 Oct 2021 Prob (F-statistic): 0.00
Time: 23:41:30 Log-Likelihood: -7750.8
No. Observations: 2895 AIC: 1.555e+04
Df Residuals: 2872 BIC: 1.568e+04
Df Model: 22
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 0.9042 0.960 0.942 0.346 -0.978 2.787
two_stars -1.9406 2.066 -0.939 0.348 -5.991 2.110
orbital_period -29.4813 12.582 -2.343 0.019 -54.152 -4.810
semi_major_axis 37.6524 11.889 3.167 0.002 14.341 60.964
planet_eccen 10.1045 0.736 13.734 0.000 8.662 11.547
planet_temp 10.8496 0.765 14.185 0.000 9.350 12.349
star_temp -5.4291 2.064 -2.630 0.009 -9.476 -1.382
star_radius 8.5836 1.918 4.475 0.000 4.823 12.345
star_mass 16.5155 1.847 8.943 0.000 12.895 20.137
star_bright -4.8203 2.210 -2.181 0.029 -9.153 -0.487
star_age 0.8674 0.360 2.410 0.016 0.162 1.573
distance -5.3557 0.792 -6.766 0.000 -6.908 -3.804
parallax 0.6445 1.442 0.447 0.655 -2.182 3.471
num_star_1 -0.9594 0.873 -1.098 0.272 -2.672 0.753
num_star_2 1.2904 0.880 1.466 0.143 -0.436 3.016
num_star_3 0.7061 1.041 0.678 0.498 -1.335 2.747
num_star_4 -0.1328 3.261 -0.041 0.968 -6.527 6.262
num_planet_1 1.6242 0.304 5.348 0.000 1.029 2.220
num_planet_2 -0.0999 0.315 -0.317 0.751 -0.718 0.518
num_planet_3 -0.2115 0.329 -0.642 0.521 -0.857 0.434
num_planet_4 -0.7739 0.365 -2.120 0.034 -1.490 -0.058
num_planet_5 -0.4839 0.439 -1.103 0.270 -1.345 0.377
num_planet_6 -1.4795 0.672 -2.201 0.028 -2.797 -0.162
num_planet_7 0.9197 1.291 0.712 0.476 -1.612 3.452
num_planet_8 1.4090 1.218 1.156 0.248 -0.980 3.798
==============================================================================
Omnibus: 313.499 Durbin-Watson: 1.055
Prob(Omnibus): 0.000 Jarque-Bera (JB): 504.267
Skew: 0.768 Prob(JB): 3.16e-110
Kurtosis: 4.350 Cond. No. 9.93e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.42e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Visualizing the accuracy of our model by plotting actual radius vs. predicted radius
residuals_full = pd.DataFrame({'actual': df['planet_radius'],
'predicted': model_full_fitted.fittedvalues,
'residual': model_full_fitted.resid})
figure(figsize=(11, 4), dpi=300)
def plot_line(axis, slope, intercept, **kargs):
xmin, xmax = axis.get_xlim()
plt.plot([xmin, xmax], [xmin*slope+intercept, xmax*slope+intercept], **kargs)
plt.scatter(residuals_full['actual'], residuals_full['predicted'], alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="red");
plt.xlabel('Actual Radius');
plt.ylabel('Predicted Radius');
plt.title('Figure 9: Scatter plot of actual vs. predicted radius for the full Model', fontsize=15);
plt.show();
We would like to check whether there are indications of violations of the regression assumptions, which are:
figure(figsize=(11, 4), dpi=200)
plt.scatter(residuals_full['predicted'], residuals_full['residual'], alpha=0.3);
plt.xlabel('Predicted Radius');
plt.ylabel('Residuals')
plt.title('Figure 10(a): Scatterplot of residuals vs. predicted Radius for Full Model', fontsize=15)
plt.show();
From this plot we see that the residuals exhibit a banding pattern, especially when Predicted Radius is below 10. The impact of the Keplar mission (as explored in Phase 1) can also be seen in this plot. The majority of datapoints which make up the left most hotspot seem to be over-estimated. Based on our previous exploration, we can assume that these data points are from the Keplar mission, which had higher success finding smaller, Earth-like exoplanets.
figure(figsize=(11, 4), dpi=200)
plt.scatter(residuals_full['actual'], residuals_full['residual'], alpha=0.3);
plt.xlabel('Actual Radius');
plt.ylabel('Residuals')
plt.title('Figure 10(b): Scatterplot of residuals vs. actual Radius for Full Model', fontsize=15)
plt.show();
The shape of this plot shows that the model over-estimates the radius of small exoplanets.
figure(figsize=(11, 4), dpi=300)
plt.hist(residuals_full['actual'], label='actual', bins=20, alpha=0.7);
plt.hist(residuals_full['predicted'], label='predicted', bins=20, alpha=0.7);
plt.xlabel('Radius');
plt.ylabel('Frequency');
plt.title('Figure 11: Histograms of actual Radius vs. predicted Radius for Full Model', fontsize=15);
plt.legend()
plt.show();
Visible in this histogram is how our original dataset has two peaks, one from Keplar, and the right most peak from all other missions. Our model seems to pick the middle ground, which causes significant inaccuracy.
This violates the normality assumption for our residual distribution, which may cause our MLR model to be significantly weaker than expected.
figure(figsize=(11, 4), dpi=300)
plt.hist(residuals_full['residual'], bins = 20);
plt.xlabel('Residual');
plt.ylabel('Frequency');
plt.title('Figure 12: Histogram of residuals for Full Model', fontsize=15);
plt.show();
Performing backwards feature selection (credit).
## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(formula_string_encoded)
# initialize feature-selected fit to full model
linreg_fit = model_full_fitted
# do backwards elimination using p-values
p_val_cutoff = 0.05
## WARNING 1: The code below assumes that the Intercept term is present in the model.
## WARNING 2: It will work only with main effects and two-way interactions, if any.
print('\nPerforming backwards feature selection using p-values:')
to_remove = []
while True:
# uncomment the line below if you would like to see the regression summary
# in each step:
# print(linreg_fit.summary())
pval_series = linreg_fit.pvalues.drop(labels='Intercept')
pval_series = pval_series.sort_values(ascending=False)
term = pval_series.index[0]
pval = pval_series[0]
if (pval < p_val_cutoff):
break
term_components = term.split(':')
print(f'\nRemoving term "{term}" with p-value {pval:.4}')
to_remove.append(str(term))
if (len(term_components) == 1): ## this is a main effect term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))
else: ## this is an interaction term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]),
patsy.EvalFactor(term_components[1])]))
linreg_fit = smf.ols(formula=patsy_description, data=data_encoded).fit()
###
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula=patsy_description, data=data_encoded).fit()
print("To remove list:", to_remove, "\n")
###
#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")
Performing backwards feature selection using p-values:
Removing term "num_star_4" with p-value 0.9675
Removing term "num_star_3" with p-value 0.8402
Removing term "num_planet_2" with p-value 0.9379
Removing term "parallax" with p-value 0.6551
Removing term "num_planet_3" with p-value 0.6294
Removing term "num_star_2" with p-value 0.4447
Removing term "num_planet_7" with p-value 0.405
Removing term "num_planet_5" with p-value 0.3469
Removing term "num_planet_8" with p-value 0.2432
Removing term "two_stars" with p-value 0.2138
Removing term "num_planet_6" with p-value 0.07032
To remove list: ['num_star_4', 'num_star_3', 'num_planet_2', 'parallax', 'num_planet_3', 'num_star_2', 'num_planet_7', 'num_planet_5', 'num_planet_8', 'two_stars', 'num_planet_6']
***
OLS Regression Results
==============================================================================
Dep. Variable: planet_radius R-squared: 0.427
Model: OLS Adj. R-squared: 0.424
Method: Least Squares F-statistic: 165.1
Date: Sun, 24 Oct 2021 Prob (F-statistic): 0.00
Time: 23:41:39 Log-Likelihood: -7755.3
No. Observations: 2895 AIC: 1.554e+04
Df Residuals: 2881 BIC: 1.562e+04
Df Model: 13
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 2.1426 0.801 2.675 0.008 0.572 3.713
orbital_period -29.8658 12.516 -2.386 0.017 -54.407 -5.324
semi_major_axis 38.0820 11.817 3.223 0.001 14.912 61.252
planet_eccen 10.1191 0.732 13.831 0.000 8.685 11.554
planet_temp 10.9099 0.761 14.336 0.000 9.418 12.402
star_temp -5.1539 2.048 -2.516 0.012 -9.170 -1.138
star_radius 8.9757 1.878 4.780 0.000 5.294 12.658
star_mass 16.2970 1.832 8.897 0.000 12.705 19.889
star_bright -5.3274 2.103 -2.534 0.011 -9.450 -1.204
star_age 0.7534 0.357 2.111 0.035 0.054 1.453
distance -5.3335 0.774 -6.891 0.000 -6.851 -3.816
num_star_1 -2.0984 0.276 -7.614 0.000 -2.639 -1.558
num_planet_1 1.8008 0.145 12.421 0.000 1.516 2.085
num_planet_4 -0.5768 0.282 -2.043 0.041 -1.130 -0.023
==============================================================================
Omnibus: 317.735 Durbin-Watson: 1.058
Prob(Omnibus): 0.000 Jarque-Bera (JB): 512.766
Skew: 0.775 Prob(JB): 4.51e-112
Kurtosis: 4.360 Cond. No. 463.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
***
Regression number of terms: 14
Regression F-distribution p-value: 0.0000
Regression R-squared: 0.4269
Regression Adjusted R-squared: 0.4243
residuals_reduced = pd.DataFrame({'actual': df['planet_radius'],
'predicted': model_reduced_fitted.fittedvalues,
'residual': model_reduced_fitted.resid})
# Creating scatter plot for reduced model
figure(figsize=(11, 4), dpi=300)
plt.scatter(residuals_reduced['actual'], residuals_reduced['predicted'], alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="red");
plt.xlabel('Actual Radius');
plt.ylabel('Predicted Radius');
plt.title('Figure 13: Scatter plot of actual vs. predicted radius for the Reduced Model', fontsize=15);
plt.show();
Deep Neural Networks (DNN) ¶import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print(tf.__version__)
2.6.0
To make our DNN more prcise, we will have to further control the amount of data that is passed through our neural network. This involves dropping features that do not contribute significantly to our target. Please note that both the results for the MLR and DNN has been scaled equally to allow for a fair comparision. This would become obvious in the Fine Tuning Plots subchapter.
orig_data_dnn = df.copy()
categorical_vars = [ "num_star", "two_stars", "num_planet"]
for var in categorical_vars:
orig_data_dnn = orig_data_dnn.astype({var: object})
# Categorical encoding for less than 2 values
for col in orig_data_dnn.columns:
q = len(orig_data_dnn[col].unique())
if (q == 2):
orig_data_dnn[col] = pd.get_dummies(orig_data_dnn[col], drop_first=True)
# For categorical features > 2 levels
orig_data_dnn = pd.get_dummies(orig_data_dnn)
print(f"There are {orig_data_dnn.shape[1]} columns with the column names {orig_data_dnn.columns.to_list()} after one hot encoding")
orig_data_dnn.shape
There are 25 columns with the column names ['two_stars', 'orbital_period', 'semi_major_axis', 'planet_radius', 'planet_eccen', 'planet_temp', 'star_temp', 'star_radius', 'star_mass', 'star_bright', 'star_age', 'distance', 'parallax', 'num_star_1', 'num_star_2', 'num_star_3', 'num_star_4', 'num_planet_1', 'num_planet_2', 'num_planet_3', 'num_planet_4', 'num_planet_5', 'num_planet_6', 'num_planet_7', 'num_planet_8'] after one hot encoding
(2895, 25)
Conversely to our intial beliefs, the DNN failed to perform approapiately when we filtered out outliers. At some cases, our NN performed so terribly it would be almost impossible to graph it. This is predominantly due to the oulier check getting rid of two thirds of our dataset, and only leaving highly biased data from the Keplar Mission. The lack of variety, far too many columns, and a much smaller dataset led to our NN making constant, obvious (and boring) predictions. To tackle our outliers, we chose to use an outlier friendly optimizer, which will be further expanded in the literature.
# # Outlier filter
# def set_outlier_nan(df):
# """
# - Finds outliers and sets their values to NaN to be processed later.
# - Excluded columns involves categories to be excluded from the outlier check
# """
# # excluded_columns = [
# # 'num_star',
# # 'num_planet',
# # 'two_stars',
# # 'longitude_gal',
# # 'latitude_gal',
# # 'parallax',
# # 'distance',
# # ]
# for column_name in df.columns:
# # conditional to exclude certain columns from the outlier check
# # if column_name in excluded_columns:
# # continue
# # else:
# column = df[column_name]
# q1 = column.quantile(0.25)
# q3 = column.quantile(0.75)
# iqr = column.quantile(0.75) - column.quantile(0.25)
# lower = q1 - 3 * iqr
# upper = q3 + 3 * iqr
# num_column_outliers = df[(column > upper) | (column < lower)]\
# .shape[0]
# # set rows that exceeds outlier parameters to none
# df[(column > upper) | (column < lower)] = np.nan
# return df
# orig_data_dnn = set_outlier_nan(df=orig_data_dnn)
# print(
# f"""
# The outlier check will get rid of {orig_data_dnn["planet_radius"].isna().sum()} planets.
# """)
# orig_data_dnn = orig_data_dnn.dropna()
# print(f"The dataset now has {orig_data_dnn.shape[0]} planets")
The following code reduces the number of unnecessary columns both by feature selection (of p values) and manual selection, to make our NN far more accurate.
data_dnn = orig_data_dnn.copy()
target_df = data_dnn['planet_radius'].values.reshape(-1, 1)
target_norm = MinMaxScaler().fit_transform(target_df)
data_dnn.loc[:,
['planet_radius']
] = pd.DataFrame(target_norm, columns=['planet_radius'])
# Apply feature selection
for col_name in to_remove:
data_dnn.drop(col_name, axis=1, inplace=True)
# Remove unecessary columns to increase precision
unnecessary_cols = [
'semi_major_axis',
'planet_eccen',
'star_mass',
'star_bright',
'num_star_1',
'num_planet_1',
'num_planet_4',
]
for col_name in unnecessary_cols:
data_dnn.drop(col_name, axis=1, inplace=True)
data_dnn.sample(3)
| orbital_period | planet_radius | planet_temp | star_temp | star_radius | star_age | distance | |
|---|---|---|---|---|---|---|---|
| 6 | 1.508956 | 0.543323 | 1898.0 | 5950.0 | 1.11 | 1.60 | 787.909 |
| 1916 | 10.290994 | 0.446940 | 1100.0 | 5800.0 | 1.35 | 7.00 | 1017.800 |
| 1581 | 3.468095 | 0.035572 | 1098.0 | 5580.0 | 0.86 | 2.75 | 1226.850 |
Partition the dataset into both training and testing datasets.
train_dataset = data_dnn.sample(frac=0.8, random_state=0)
test_dataset = data_dnn.drop(train_dataset.index)
print(
f"""
--- Dataset Sizes ---
Original Dataset: {data_encoded.shape}
Training Dataset: {train_dataset.shape}
Testing Dataset: {test_dataset.shape}
"""
)
--- Dataset Sizes --- Original Dataset: (2895, 25) Training Dataset: (2316, 7) Testing Dataset: (579, 7)
train_dataset.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| orbital_period | 2316.0 | 96.845960 | 1781.304729 | 0.280324 | 3.965398 | 9.112331 | 21.405835 | 69000.000000 |
| planet_radius | 2316.0 | 0.133587 | 0.153514 | 0.000000 | 0.042859 | 0.069687 | 0.118790 | 0.764275 |
| planet_temp | 2316.0 | 925.511658 | 454.383865 | 125.000000 | 580.750000 | 836.500000 | 1170.250000 | 3186.000000 |
| star_temp | 2316.0 | 5515.322936 | 707.367629 | 2566.000000 | 5190.750000 | 5653.000000 | 5961.250000 | 9360.000000 |
| star_radius | 2316.0 | 1.040203 | 0.414777 | 0.010000 | 0.810000 | 0.960000 | 1.210000 | 6.300000 |
| star_age | 2316.0 | 4.194171 | 2.752552 | 0.012000 | 2.600000 | 3.890000 | 4.802500 | 14.000000 |
| distance | 2316.0 | 706.413402 | 479.209174 | 3.290000 | 342.143000 | 641.327500 | 969.418500 | 4483.050000 |
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('planet_radius')
test_labels = test_features.pop('planet_radius')
train_dataset.describe().transpose()[['mean', 'std']]
| mean | std | |
|---|---|---|
| orbital_period | 96.845960 | 1781.304729 |
| planet_radius | 0.133587 | 0.153514 |
| planet_temp | 925.511658 | 454.383865 |
| star_temp | 5515.322936 | 707.367629 |
| star_radius | 1.040203 | 0.414777 |
| star_age | 4.194171 | 2.752552 |
| distance | 706.413402 | 479.209174 |
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
print(normalizer.mean.numpy())
[[ 96.846 925.512 5515.323 1.04 4.194 706.413]]
first = np.array(train_features[:1])
with np.printoptions(precision=2, suppress=True):
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())
First example: [[ 3.48 1156. 5418. 0.93 8.71 450. ]] Normalized: [[-0.05 0.51 -0.14 -0.27 1.64 -0.54]]
Functions for plotting graphs later on.
def plot_loss(history):
figure(figsize=(15, 4), dpi=150)
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.title("Fig. 14: Loss Analysis")
plt.xlabel('Epoch')
plt.ylabel('Error [Planet Radius]')
plt.legend()
plt.grid(True)
def plot_history(history):
figure(figsize=(15, 4), dpi=150)
plt.plot(history.history['mean_squared_error'])
plt.plot(history.history['val_mean_squared_error'])
plt.title('Fig. 15: Mean Squared Error over Epoch')
plt.ylabel('Mean Squared Error [Planet Radius]')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='lower right')
plt.show()
Builds and compiles the model structure. Further discussed in literature.
def build_and_compile_model(norm):
model = keras.Sequential([
norm,
layers.Dense(24, activation='relu', name='layer1'),
layers.Dropout(0.015),
layers.Dense(24, activation='relu', name='layer2'),
layers.Dropout(0.00),
layers.Dense(1, activation='linear', name='output_layer')
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0007)
# optimizer = tf.keras.optimizers.SGD(learning_rate=10e-4, decay=1e-6, momentum=0.5)
# For loss justification: https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
model.compile(loss='mean_absolute_error',
optimizer=optimizer,
# https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b
metrics=[tf.keras.metrics.MeanSquaredError()])
return model
dnn_model = build_and_compile_model(normalizer)
dnn_model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= normalization (Normalization (None, 6) 13 _________________________________________________________________ layer1 (Dense) (None, 24) 168 _________________________________________________________________ dropout (Dropout) (None, 24) 0 _________________________________________________________________ layer2 (Dense) (None, 24) 600 _________________________________________________________________ dropout_1 (Dropout) (None, 24) 0 _________________________________________________________________ output_layer (Dense) (None, 1) 25 ================================================================= Total params: 806 Trainable params: 793 Non-trainable params: 13 _________________________________________________________________
Train the model based off the training data. Further discussed in literature.
%%time
history = dnn_model.fit(
train_features,
train_labels,
validation_split=0.2,
verbose=0, epochs=150)
Wall time: 21 s
Key plots that denotes model performance, which is than used to fine tune our network. Even though this process will be discussed upon in our literature, small summaries underneath each graph will highlight the intentions behind every plot.
plot_loss(history)
Figure 14 demonstrates the loss anlaysis graph for our current dnn. The graph shows error prediction of our target feature, planet radius, as the network gradually learns (over epoch). As a result, loss analysis is key to determining the overall learning performance of a neural network. To elaborate further, loss is the value the neural network is trying to minimize over time, a lower loss value is indicative of more accurate predictions. Consequntly, the dnn learns by readjusting tis node's weights and biases in a manner which reduces the loss. Val Loss is applied to the test set, and is a good indication of how the network will perform with data it has never seen before. (1)
Our loss analysis graph indicates a good learning progress for our dnn. Firstly, both loss values takes the shape of an inverse logarthmic, which demonstartes a great learning performance as our epoch progresses. Secondly, the error values for both curves can be considered to be quite low (in terms of astronomical predictions), highlighting high accuracy when predicting our target feature. Finally, our loss and val_loss curves are very tightly packed together. This illustrates our dnn learning the data approapiately, rather than just memorising the training data, or failing to see any connections within the dataset. Despite such positives, there does seem to be a minor issue of the curves becoming quite jaggered, indicating a lack of confidence in our dnn prdeictions. This, however, may be a limitation of the dataset we are working on (see the discussion subchapter to understand how this was reduced).
Overall, fig 14. displays the great learning progress of our dnn as epoch progresses. However, the curves do come out as slightly jaggered, which may show our dnn strugling to learn at certain stages.
plot_history(history)
Figure 15 denotes the mean squared error (mse) as epoch increases for our dnn. "MSE is an absolute measure of how well the model fits" (https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b), and is denoted by the following formula:
Given so, fig. 15 heavily takes on the shape of fig. 14, confirming the great dnn performance over both train and test data. In analysis, our mse graph signifies a very high accuracy rate for our current dnn model. The low mse values is indicative of how little the predicted results (in both training and test) deviate form the actual number. However, the slightly jaggered curves do highlight our model struggling to learn at certain cases (as mentioned previously). Again, this is likely to be a limitation of the dataset that was used.
Output MSE data.
# Gather Results and Predict data via neural network
test_results = pd.DataFrame([])
test_results['dnn_model'] = dnn_model.evaluate(test_features, test_labels, verbose=1)
19/19 [==============================] - 0s 2ms/step - loss: 0.0688 - mean_squared_error: 0.0169
Create a prediction based off our test data.
test_predictions = dnn_model.predict(test_features).flatten()
Normalise MLR data to the same scale so accurate comaprisions may be made.
# Normalise the residuals_reduced df results between 0 and 1 for performance analysis.
# mlr_error_reshaped = residual.values.reshape(-1, 1)
residuals_reduced_norm = pd.DataFrame(MinMaxScaler().fit_transform(residuals_reduced), columns=residuals_reduced.columns)
residuals_reduced_norm.sample(3)
| actual | predicted | residual | |
|---|---|---|---|
| 2770 | 0.439421 | 0.493370 | 0.429738 |
| 880 | 0.102146 | 0.168399 | 0.354297 |
| 799 | 0.022655 | 0.259938 | 0.222285 |
from scipy.stats import norm, t
sns.set_theme(style="darkgrid")
# Key stats
dnn_error = test_predictions - test_labels
mlr_error = residuals_reduced["predicted"] - residuals_reduced["actual"]
mlr_error_reshaped = mlr_error.values.reshape(-1, 1)
mlr_norm = MinMaxScaler().fit_transform(mlr_error_reshaped)
std_df_mlr = pd.DataFrame({"value": residuals_reduced_norm['predicted'].std(), "Model": "MLR", "type":"Standard Deviation"}, index=[0])
std_df_dnn = pd.DataFrame({"value": test_predictions.std(), "Model": "DNN", "type":"Standard Deviation"}, index=[0])
mean_df_mlr = pd.DataFrame({"value": residuals_reduced_norm['predicted'].mean(), "Model": "MLR", "type":"Mean"}, index=[0])
mean_df_dnn = pd.DataFrame({"value": test_predictions.mean(), "Model": "DNN", "type":"Mean"}, index=[0])
area_df_mlr = pd.DataFrame({"value": norm.cdf(x=mlr_norm.mean()+0.05, loc=mlr_norm.mean(), scale=mlr_norm.std()) - norm.cdf(x=mlr_norm.mean()-0.05, loc=mlr_norm.mean(), scale=mlr_norm.std()), "Model": "MLR", "type":"Tightness of\n Normalised Bell Curve\n(lower is better)\n[Area between -0.05 to 0.05]"}, index=[0])
area_df_dnn = pd.DataFrame({"value": norm.cdf(x=dnn_error.mean()+0.05, loc=dnn_error.std(), scale=dnn_error.std()) - norm.cdf(x=dnn_error.mean()-0.05, loc=dnn_error.std(), scale=dnn_error.std()), "Model": "DNN", "type":"Tightness of\n Normalised Bell Curve\n(lower is better)\n[Area between -0.05 to 0.05]"}, index=[0])
models_stats = pd.concat([
std_df_mlr,
std_df_dnn,
mean_df_mlr,
mean_df_dnn,
area_df_mlr,
area_df_dnn
])
figure(figsize=(14, 4), dpi=150)
sns.barplot(data=models_stats, x='type', y='value', hue='Model', palette='pastel')
plt.title("Fig. 16: Key Statistics for both Predicted Models")
plt.xlabel("Statistical Type")
plt.ylabel("Units of Measurement")
plt.show()
Figure 16 displays key statistics for both our MLR and DNN models. We will go through them one by one.
test_predictions = dnn_model.predict(test_features).flatten()
sns.set_theme(style="darkgrid")
figure(figsize=(4, 4), dpi=150)
# a = plt.axes(aspect='equal')
sns.histplot(data=residuals_reduced_norm, x='actual', y='predicted', bins=100, cmap="crest", label='MLR model')
sns.kdeplot(data=residuals_reduced_norm, x='actual', y='predicted', levels=4, color="green", linewidths=0.5)
sns.histplot(x=test_labels, y=test_predictions, bins=100, cmap="flare", label='DNN model')
sns.kdeplot(x=test_labels, y=test_predictions, levels=4, color="red", linewidths=0.5)
plt.title("Fig. 17a\nPrediction Performance over both Models\n (Exoplanet Radius)")
plt.xlabel('True Values [Planet Radius]')
plt.ylabel('Predictions [Planet Radius]')
lims = [0, 1]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims, color='black')
plt.legend(labels=["Referance", "MLR model", "DNN model"], title = "Plot Labels", title_fontsize = "8", loc = 2, bbox_to_anchor = (1,1))
plt.show()
figure(figsize=(4, 4), dpi=150)
# a = plt.axes(aspect='equal')
sns.histplot(data=residuals_reduced_norm, x='actual', y='predicted', bins=300, cmap="crest")
sns.kdeplot(data=residuals_reduced_norm, x='actual', y='predicted', levels=3, color="green", linewidths=1)
sns.histplot(x=test_labels, y=test_predictions, bins=300, cmap="flare")
sns.kdeplot(x=test_labels, y=test_predictions, levels=5, color="red", linewidths=1)
plt.title("Fig. 17b\nPrediction Performance over both Models\n (Exoplanet Radius)\n [Zoomed towards keplar cluster]\n")
plt.xlabel('True Values [Planet Radius]')
plt.ylabel('Predictions [Planet Radius]')
lims = [0, 0.3]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims, color='black')
plt.legend(labels=["Referance", "MLR model", "DNN model"], title = "Plot Labels", title_fontsize = "8", loc = 2, bbox_to_anchor = (1,1))
plt.show()
Figure 17 demonstrates the prediction performance for both of our models. Over here, we will call the black line the referance line, the green is MLR while red is our DNN. The y axis represents predictions made by the model, while the x indicates the true value (essentially, the closer it is to referance, the greater prediction performance the model has). Notice how the scale is between 0 and 1, resulting in 0.03 - 0.10 most likely representing planets around 3/4 to 4 times the size of planet earth. Each dot represents the model predicting an exoplanet exists under its corresponding, estimated radius.
Form both plots, it is evident that our dnn outperforms our mlr by a large margin. Our mlr model is seen to overpredict in its predictions, while making predictions over a larger range when compared to our dnn. This is further supported by our kde plots. Irregular rings on both models depicts the models primary prediction fields. In fig. 17b, our dnn seem to make predictions centered around the referance while our mlr places its fields over the referance. Interestingly, our dnn manages to understand patterns while predicting our outliers (in astronomy and given our dataset, the ability to predict larger, harder to detect planets in our universe makes our model more reliable and accurate).
Fig. 17a better demonstrates larger planet prediction for both models. Compared to our dnn, the mlr makes frequent large planet predictions despite them being quite rare in our dataset. In stark contrast, our dnn focuses its predictions towards the keplar cluster, while still maing rare predictions for larger planets. However, while still predicting larger planets, both models clearly struggle in such prediction. 17a, clearly highlights an issue in which the models will try to create radical predictions for larger planets. While most predictions of large planets are fairly accurate, this is a limitation of the dataset that was used, which rarely captured extremelly large exoplanets.
Overall, the above plots plays as great evidence of our dnn outperforming our mlr. To our surprise, the dnn seem to make fairly accurate predictions for most of the outlier exoplanets in our dataset, however it does tend to underextimate as the target feature increases.
dnn_error = test_predictions - test_labels
mlr_error = residuals_reduced["predicted"] - residuals_reduced["actual"]
mlr_error_reshaped = mlr_error.values.reshape(-1, 1)
mlr_norm = MinMaxScaler().fit_transform(mlr_error_reshaped)
sns.set_theme(style="darkgrid")
figure(figsize=(11, 4), dpi=400)
plt.title("Fig. 18: Residual Perfomance over Prediction Models")
sns.distplot(dnn_error, label="DNN Model")
sns.distplot((0.68-mlr_norm), label="MLR Model")
plt.xlabel('Residual Performance [Planet Radius]')
_ = plt.ylabel('Relative Frequency of Occurance')
plt.legend()
plt.show()
Figure 18 displays residual performance for both prediction models. The blue graph indicates our dnn while the orange is our mlr.
The above clearly depicts our dnn containing a high frequency of near 0 residual, demostrating excellent residual performance. This is in direct comparision to the mlr, which demonstrates a wider residual curve, which is often indicative of lower accuracy. It is also interesting to note the tails of both distributions. Our dnn curve demonstrates an inflated tail on the left, which may indicate underestimation for larger planets. Meanwhile, the mlr demonstrates a noticeable bump on its right tail, which explains the extreme overestimation fhow in fig. 17b.
Overall, the residual performance graph signifies our fined tuned dnn having a much accurate performance when compared to our mlr.
This discussion will aim to inform the reader about the neural network that was used, its structure, the fine tuning process and other relevant decisions. Performance and limitations of the network will also be discussed.
Our deep neural network (dnn) made use of the following parameters.
Number of input nodes: 6
Hidden layer:
Output layer: Number of nodes: 1
Compilation process:
Model training:
The decision using such parameters will be discussed in the process section.
Our network contains a total of 55 nodes and 744 edges. Here is the structure:
As seen above, our dnn contains 6 input nodes followed by two hidden layers. Initially, we passed in 13 features making our network quite complex. This took a toll on accuracy and the network began to make quite radical predictions, which resulted in harsher, jagged lines for our loss graph, as shown below:
After this, we realised we were passing in columns that had little meaning towards our target feature, so we decided to prune the amount of features we passed in to about 6. This drastically stabilized our current loss graph:
Adding dropouts in between hidden layers also helped in creating data that is more reliable. However, having a high dropout number resulted in the two lines merging together and warping in unsuspecting ways (they would become more jaggered for ex.). This led us to a number of 0.015 which gave us better prediction performance.
In terms of the amount of hidden layers, we initially experimented with 3 layers:
And even 5:
Even though we achieved a very good learning rate using 3 (or even 5 layers), it overcomplicated the network by a large factor. First off all, the network was failing to learn the data but just became really good at just memorising the training data. This led to very distant endpoints for both of our curves which we deemed undesirable. We decided on 2 hidden layers as it simplified the network (reduced jagged curves) and it allowed our val_loss and loss to more tightly stick together. This resulted in our dnn performing better on unseen data when compared to having a more complex model.
For the number of nodes, we initially tested out 64 nodes. After further experimentation, we doubled that value to 128. When simplifying our network however, we decided that an increase in nodes negatively affected the stability of our entire network, and resulted in a greater variance in predictions. After adjustments, we found 24 nodes gave us a nice middle ground between stability and prediction performance.
As our problem consisted of a linear regression, we utilised a single noded output with a linear activation function. Hidden layers also used relu as their activation function given our linear regression problem. Such a function allowed for the network to learn the complex relationships within our data better given our problem type. (3)
During the compilation process, an Adam optimizer was chosen over SGD. While SGD worked desirably with our dataset, we found Adam to optimize the data in a more predictable fashion. As "[Adam] combines the advantages of [...] AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives" (4), we found Adam to better predict larger planets (outliers) more precisely when compared to SGD. It also improved the overall prediction performance of our NN by predicting exoplanets closer to the reference line in fig. 17.
Epochs played a great role in the way in which our dnn learned. A high number of 500 epochs was initially tested and yielded the following results:
Such a high epoch number encouraged our network to memorise the training data rather than learn from it. This explains why the network progressively performs worse on data it has not seen before (error rate increases) when compared to the error rate from data it has learned from (error rate decreases). After experimenting with an epoch of 100, we realised the learning was cut off too shortly and therefore decided with an epoch of 150. This number consistently produced a reliable loss graph over multiple trials.
This concludes the discussion portion of our deep neural networks. Overall, we discussed the hyperparameters used, dnn topology, its structure, and the process behind tuning our dnn. Additional tuning graphs were provided to complement the discussion.
Critique & Limitations ¶The primary underlying flaw of this two-phased report lies within our dataset. Historically, large clusters of planets are discovered based on missions that span over a specific period of time over a specific instrument. One of the instruments used to create our dataset was the Kepler Space Telescope. Kepler’s primary mission was to discover earth-like planets that may harbour extraterrestrial life. As a result, its discoveries were skewed towards smaller earth-like planets rather than larger planets (which are often uninhabitable). When graphing our target feature, this generated two dissimilar peaks that both of our models experienced trouble learning across. Our DNN performed noticeably better than our MLR in such cases.
On another worthy note, it is worth mentioning how our dataset may not be truly representative of all exoplanets in the universe. Given Kepler’s mission and how exoplanets are truly dependent on certain features, our dataset may mostly have exoplanets that were more easily discoverable. A rogue planet with no star, planets orbiting a very dim or a smaller star may actually be far more common in the universe rather than the regular planets within our dataset, but are just harder to record their existence. These planets may therefore seem much rarer to us even if they are more common in the universe.
Unfortunately, our mlr model possesses some notable weakenesses in terms of prediction accuracy. As metioned before, our dataset contains two irregular peaks indicative of smaller sized planet and larger exoplants. Our mlr model tries to merge these two peaks together to create a single peak, and in the process of doing so, creates a large amount of predictions that overfit the data. This may be more clearly illustrated by the plot: Figure 11. The residuals for our mlr data also fail to obey the 4 main rules for residuals. Firstly, as our MLR faces trouble by our two peaked dataset the residuals do not produce a perfectly normalised graph. Secondly the residuals are also inconsistent (in terms of their distribution) as planet radius grows. However, where the mlr has its weaknesses, our dnn proves its strengths.
In terms of strengths for our dnn model, it can create predictions based on our target with high accuracy. Unlike our MLR model, the dnn is not disturbed by the two different peaks in our datasets and accurately realises trends, especially for smaller exoplanets (fig. 17). While our model sometimes under-estimates exoplanets' radius for larger planets, it still manages to mostly make fairly accurate predictions. Another notable strength is clustering in its prediction. Our dnn model prefers to cluster its predictions over the reference line, making predictions far more accurate in comparision to the massive spread of predictions given out by MLR.
Another untold advantage of our mlr model is that it tends to produce a more linear prediction field for larger values (fig. 17a) when compared to our dnn. While our dnn is quite competitive (even for larger exoplanets), our MLR may actually perform better when faced against larger exoplanet predictions.
Summary & Conclusions ¶The summary for Phase 1 will be described in the following steps:
The summary for Phase 2 will be described in the following steps:
Our R-squared value for our reduced model with 13 variables is 0.427. In this moderate strength model, semi_major_axis, orbital_period, star_mass, star_radius and distance play the largest role in predicting planet_radius. We predicted that as distance increases, planet radius would increase, but our findings contradict this. We do not know whether this would be the case if our dataset was unimodal. Semi_major_axis (or the average distance an exoplanet is from its star) was the strongest indicator of star size. Star radius and star mass were also highly weighted, which matches our predictions that larger stars would provide more detectable exo-planets.
Diagnostic Checks indicated that there was slight banding for planets below the radius of 10, and a bimodal residual distribution. This may be the cause of the lack of MLR strength The significantly higher accuracy of our neural network suggests that there are much better ways of predicting planet_radius than MLR with our chosen dataset features. This model accurately estimates the radius of exoplanets discovered by Kepler, whereas the MLR model overestimates these planets.
The residuals of the neural network exhibit a much more normal, and tighter distribution around 0, indicating that the neural network much more accurately predicts planet_radius. The connections within the neural network are not easily parsable, so while we cannot draw direct conclusions from these relationships like with MLR, we still have a highly accurate model.
Our objective was to predict the radius of a discovered exoplanet, and explore the features which most strongly affect this prediction. Due to the nature of our dataset, our MLR model was not accurate enough to fully explore this relationship, however some features were highly significant. We discovered significant relationships between planet radius, orbital distance and star size, despite the difficulties of the bimodal model.
We can accurately predict planet radius using the Deep Neural Network to a much higher degree of accuracy compared to our reduced MLR model. This model has also been confirmed to not be overtrained to our dataset, and as such it may be suitable for new exoplanet data, and for predicting the presence of exoplanets in unsurveyed systems.
Referances ¶Nicolas Gervais, N. G. (2017, November 15). How to understand loss acc val_loss val_acc in Keras model fitting. Stack Overflow. Retrieved October 20, 2021, from https://stackoverflow.com/questions/47299624/how-to-understand-loss-acc-val-loss-val-acc-in-keras-model-fitting
Wu, S. (2021, June 5). 3 Best metrics to evaluate Regression Model? - Towards Data Science. Medium. Retrieved October 21, 2021, from https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b
Brownlee, J. (2020, August 20). A Gentle Introduction to the Rectified Linear Unit (ReLU). Machine Learning Mastery. Retrieved October 21, 2021, from https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
Kingma, D. P. (2014, December 22). Adam: A Method for Stochastic Optimization. ArXiv.Org. Retrieved October 23, 2021, from https://arxiv.org/abs/1412.6980